We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation. Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode-seeking behavior. By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse-view novel view synthesis.
translated by 谷歌翻译
我们描述了一种数据驱动的方法,用于指定任意对象的多个图像,以推断相机观点。该任务是经典几何管道(例如SFM和SLAM)的核心组成部分,也是当代神经方法(例如NERF)的至关重要的预处理要求,以对象重建和视图合成。与现有的对应驱动的方法相反,鉴于稀疏视图的表现不佳,我们提出了一种基于自上而下的预测方法来估计相机观点。我们的主要技术见解是使用基于能量的公式来表示相对摄像机旋转的分布,从而使我们能够明确表示由对象对称或视图引起的多个摄像机模式。利用这些相对预测,我们共同估计了来自多个图像的一致摄像机旋转集。我们表明,我们的方法优于最先进的SFM和SLAM方法,并且在可见和看不见的类别上都稀疏图像。此外,我们的概率方法显着优于直接回归相对姿势的表现,这表明对多模型建模对于相干关节重建很重要。我们证明,我们的系统可以是从多视图数据集中进行野外重建的垫脚石。可以在https://jasonyzhang.com/relpose上找到带有代码和视频的项目页面。
translated by 谷歌翻译
使用视频指定任务是获取新颖和一般机器人技能的强大技术。然而,推理机械和灵巧的互动可以使其挑战规模学习接触的操纵。在这项工作中,我们专注于视觉非预先展示平面操作的问题:给定平面运动中对象的视频,找到再现相同对象运动的联系人感知机器人动作。我们提出了一种新颖的架构,可微分的操纵(\我们)的学习,它通过利用可微分优化和基于有限差分的模拟来将视频解码与接触机械的前沿的视频解码神经模型结合在一起。通过广泛的模拟实验,研究了基于模型的技术与现代深度学习方法之间的相互作用。我们发现,我们的模块化和完全可差的架构比看不见的对象和运动的学习方法更好。 \ url {https://github.com/baceituno/dlm}。
translated by 谷歌翻译
Monitoring water is a complex task due to its dynamic nature, added pollutants, and land build-up. The availability of high-resolu-tion data by Sentinel-2 multispectral products makes implementing remote sensing applications feasible. However, overutilizing or underutilizing multispectral bands of the product can lead to inferior performance. In this work, we compare the performances of ten out of the thirteen bands available in a Sentinel-2 product for water segmentation using eight machine learning algorithms. We find that the shortwave infrared bands (B11 and B12) are the most superior for segmenting water bodies. B11 achieves an overall accuracy of $71\%$ while B12 achieves $69\%$ across all algorithms on the test site. We also find that the Support Vector Machine (SVM) algorithm is the most favourable for single-band water segmentation. The SVM achieves an overall accuracy of $69\%$ across the tested bands over the given test site. Finally, to demonstrate the effectiveness of choosing the right amount of data, we use only B11 reflectance data to train an artificial neural network, BandNet. Even with a basic architecture, BandNet is proportionate to known architectures for semantic and water segmentation, achieving a $92.47$ mIOU on the test site. BandNet requires only a fraction of the time and resources to train and run inference, making it suitable to be deployed on web applications to run and monitor water bodies in localized regions. Our codebase is available at https://github.com/IamShubhamGupto/BandNet.
translated by 谷歌翻译
In this paper, we discuss an imitation learning based method for reducing the calibration error for a mixed reality system consisting of a vision sensor and a projector. Unlike a head mounted display, in this setup, augmented information is available to a human subject via the projection of a scene into the real world. Inherently, the camera and projector need to be calibrated as a stereo setup to project accurate information in 3D space. Previous calibration processes require multiple recording and parameter tuning steps to achieve the desired calibration, which is usually time consuming process. In order to avoid such tedious calibration, we train a CNN model to iteratively correct the extrinsic offset given a QR code and a projected pattern. We discuss the overall system setup, data collection for training, and results of the auto-correction model.
translated by 谷歌翻译
Language-conditioned policies allow robots to interpret and execute human instructions. Learning such policies requires a substantial investment with regards to time and compute resources. Still, the resulting controllers are highly device-specific and cannot easily be transferred to a robot with different morphology, capability, appearance or dynamics. In this paper, we propose a sample-efficient approach for training language-conditioned manipulation policies that allows for rapid transfer across different types of robots. By introducing a novel method, namely Hierarchical Modularity, and adopting supervised attention across multiple sub-modules, we bridge the divide between modular and end-to-end learning and enable the reuse of functional building blocks. In both simulated and real world robot manipulation experiments, we demonstrate that our method outperforms the current state-of-the-art methods and can transfer policies across 4 different robots in a sample-efficient manner. Finally, we show that the functionality of learned sub-modules is maintained beyond the training process and can be used to introspect the robot decision-making process. Code is available at https://github.com/ir-lab/ModAttn.
translated by 谷歌翻译
The people in the world who are hearing impaired face many obstacles in communication and require an interpreter to comprehend what a person is saying. There has been constant scientific research and the existing models lack the ability to make accurate predictions. So we propose a deep learning model trained on ASL i.e. American Sign Language which will take actions in the form of ASL as input and translate it into text. To achieve the translation a Convolution Neural Network model and a transfer learning model based on the VGG16 architecture are used. There has been an improvement in accuracy from 94% of CNN to 98.7% of Transfer Learning, an improvement of 5%. An application with the deep learning model integrated has also been built.
translated by 谷歌翻译
Recent video+language datasets cover domains where the interaction is highly structured, such as instructional videos, or where the interaction is scripted, such as TV shows. Both of these properties can lead to spurious cues to be exploited by models rather than learning to ground language. In this paper, we present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or `soccer') highlights videos with transcribed live commentaries in English. As the course of a game is unpredictable, so are commentaries, which makes them a unique resource to investigate dynamic language grounding. We also provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation. Results show that SOTA models perform reasonably well in most tasks. We discuss the implications of these results and suggest new tasks for which GOAL can be used. Our codebase is available at: https://gitlab.com/grounded-sport-convai/goal-baselines.
translated by 谷歌翻译
Industries must follow government rules and regulations around the world to classify products when assessing duties and taxes for international shipment. Harmonized System (HS) is the most standardized numerical method of classifying traded products among industry classification systems. A hierarchical ensemble model comprising of Bert- transformer, NER, distance-based approaches, and knowledge-graphs have been developed to address scalability, coverage, ability to capture nuances, automation and auditing requirements when classifying unknown text-descriptions as per HS method.
translated by 谷歌翻译